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ABSTRACT 


In  this  paper  three  models  of  parallel  speedup  are  studied.  They  are  fixed-size  speedup, 
fixed-time  speedup  and  memory-bounded  speedup.  The  latter  two  consider  the  relationship 
between  speedup  and  problem  scalability.  Two  sets  of  speedup  formulations  are  derived 
for  these  three  models.  One  set  considers  uneven  workload  allocation  and  communication 
overhead,  and  gives  more  accurate  estimation.  Another  set  considers  a  simplified  case  and 
provides  a  clear  picture  on  the  impact  of  the  sequential  portion  of  an  application  on  the 
possible  performance  gain  from  parallel  processing.  The  simplified  fixed-size  speedup  is  Am¬ 
dahl’s  law.  The  simplified  fixed-time  speedup  is  Gustafson’s  scaled  speedup.  The  simplified 
memory-bounded  speedup  contains  both  Amdahl’s  law  and  Gustafson’s  scaled  speedup  eis 
special  cases.  This  study  leads  to  a  better  understanding  of  parallel  processing. 
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1  Introduction 


Although  parallel  processing  has  become  a  common  approach  for  achieving  high  performance,  there 
is  no  well-established  metric  to  measure  the  performance  gain  of  parallel  processing.  The  most 
commonly  used  performance  metric  for  parallel  processing  is  speedup,  which  gives  the  performance 
gain  of  parallel  processing  versus  sequential  processing.  Traditionally,  speedup  is  defined  as  the 
ratio  of  uniprocessor  execution  time  to  execution  time  on  a  parallel  processor.  There  are  different 
ways  to  define  the  metric  “execution  time”.  In  fixed-size  speedup,  the  amount  of  work  to  be  executed 
is  independent  of  the  number  of  processors.  Based  on  this  model.  Ware  [17]  summarized  Amdahl’s 
[1]  arguments  to  define  a  speedup  formula  which  is  known  as  Amdahl’s  law.  However,  in  many 
applications,  the  amount  of  work  to  be  performed  increases  (as  the  number  of  processors  increases) 
in  order  to  obtain  a  more  accurate  or  better  result.  The  concept  of  scaled  speedup  was  proposed  by 
Gustafson  et  al.  at  Sandia  National  Laboratory  [6].  Based  on  this  concept,  Gustafson  suggested  a 
fixed-time  speedup  [5],  which  fixes  the  execution  time  and  is  interested  in  how  the  problem  size  can 
be  scaled  up.  In  scaled  speedup,  both  sequential  and  parallel  execution  times  are  measured  based 
on  the  same  amount  of  work  defined  by  the  scaled  problem. 

Both  Amdahl’s  law  and  Gustafson’s  scaled  speedup  use  a  single  parameter,  the  sequential 
portion  of  a  parallel  algorithm,  to  characterize  an  application.  They  are  simple  and  give  much 
insight  into  the  potential  degradation  of  parallelism  as  more  processors  become  available.  Amdahl’s 
law  has  a  fixed  problem  size  and  is  interested  in  how  small  the  response  time  could  be.  It  suggests 
that  massively  parallel  processing  may  not  gain  high  speedup.  Gustafson  [5]  approaches  the  problem 
from  another  point  of  view.  He  fixes  the  response  time  and  is  interested  in  how  large  a  problem 
could  be  solved  within  this  time.  This  paper  further  investigates  the  scalability  of  problems.  While 
Gustafson’s  scalable  problems  are  constrained  by  the  execution  time,  the  capacity  of  main  memory 
is  also  a  critical  metric.  For  parallel  computers,  especially  for  distributed-memory  multiprocessors, 
the  size  of  scalable  problems  is  often  determined  by  the  memory  available.  Shortage  of  memory  is 
paid  for  in  problem  solution  time  (due  to  the  I/O  or  message-passing  delays)  and  in  programmer 
time  (due  to  the  additional  coding  required  to  multiplex  the  distributed  memory)  [3].  For  many 
applications,  the  amount  of  memory  is  an  important  constraint  to  scaling  problem  size  [6,  10], 
Thus,  memory-bounded  speedup  is  the  major  focus  of  this  paper. 

We  first  study  three  models  of  speedup;  fixed-size  speedup,  fixed-time  speedup,  and  memory- 
hounded  speedup.  With  both  uneven  workload  allocation  and  communication  overhead  considered, 
speedup  formulations  will  be  derived  for  all  three  models.  When  communication  overhead  is  not 
considered  and  the  workload  only  consists  of  sequential  and  perfectly  parallel  portions,  the  simplified 
fixed-size  speedup  is  Amdahl’s  law,  the  simplified  fixed-time  speedup  is  Gustafson’s  scaled  speedup; 
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and  the  simplified  memory-bounded  speedup  contains  both  Amdahl’s  law  and  Gustafson’s  speedup 
as  special  cases.  Therefore,  the  three  models  of  speedup,  which  represent  different  points  of  view, 
are  unified. 

Based  on  the  concept  of  scaled  speedup,  intensive  research  has  been  conducted  in  recent  years 
in  the  area  of  performance  evaluation.  Some  other  definitions  of  speedup  have  also  been  proposed, 
such  as  generalized  speedup,  cost-related  speedup,  and  superlinear  speedup.  Interested  readers  can 
refer  to  [14,  9,  16,  7,  18,  2,  8]  for  details. 

This  paper  is  organized  as  follows.  In  Section  2  we  introduce  the  program  model  and  some 
basic  terminologies.  More  generalized  speedup  formulations  for  the  three  models  of  speedup  are 
presented  in  Section  3.  Speedup  formulations  for  simplified  cases  are  studied  in  Section  4.  The 
influence  of  communication/memory  tradeoff  is  studied  in  Section  5.  Conclusions  and  comments 
are  given  in  Section  6. 

2  A  Model  of  Parallel  Speedup 

To  measure  different  speedup  metrics  for  scalable  problems,  the  underlying  machine  is  assumed  to 
be  a  scalable  multiprocessor.  A  multiprocessor  is  considered  scalable  if,  as  the  number  of  processors 
increase,  the  memory  capacity  and  network  bandwidth  also  increase.  Furthermore,  all  processors 
are  assumed  to  be  homogeneous.  Most  distributed-memory  multiprocessors  and  multicomputers, 
such  as  commercial  hypercube  and  mesh-connected  computers,  are  scalable  multiprocessors.  Both 
message-passing  and  shared-memory  programming  paradigms  have  been  used  in  such  multiproces¬ 
sors.  To  simplify  the  discussion,  our  study  assumes  homogeneous  distributed-memory  architectures. 

The  parallelism  in  an  application  can  be  characterized  in  different  ways  for  different  purposes 
[15].  For  simplicity,  speedup  formulations  generally  use  very  few  parameters  and  consider  very  high 
level  characterizations  of  the  parallelism.  We  consider  two  main  degradations  of  parallelism,  uneven 
allocation  (load  imbalance)  and  communication  latency.  The  former  degradation  is  application 
dependent.  The  latter  degradation  depends  on  both  the  application  and  the  parallel  computer 
under  consideration.  To  obtain  an  accurate  estimate,  both  degradations  need  to  be  considered. 
Uneven  allocation  is  measured  by  degree  of  parallelism. 

Definition  1  The  degree  of  parallelism  of  a  program  is  an  integer  which  indicates  the  maximum 
number  of  processors  that  can  be  busy  computing  at  a  particular  instant  in  time,  given  an  unbounded 
number  of  available  processors. 

The  degree  of  parallelism  is  a  function  of  time.  By  drawing  the  degree  of  parallelism  over  the 
execution  time  of  an  application,  a  graph  can  be  obtained.  We  refer  to  this  graph  as  the  parallelism 
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profile.  Figure  1  is  the  parallelism  profile  of  a  hypothetical  divide-and-conquer  computation  [13]. 
By  accumulating  the  time  spent  at  each  degree  of  parallelism,  the  profile  can  be  rearranged  to  form 
the  shape  (see  Figure  2)  of  the  application  [12]. 
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Figure  1.  Parallelism  profile  of  an  application. 

Let  W  be  the  amount  of  work  of  an  application.  Work  can  be  defined  as  arithmetic  operations, 
instructions,  or  whatever  is  needed  to  complete  the  application.  Formally,  the  speedup  with  N 
processors  and  with  the  total  amount  of  work  W  is  defined  as 


Sn{W)  = 


TijW) 

TNiwy 


(1) 


where  Ti{W)  is  the  time  required  to  complete  W  amount  of  work  on  i  processors.  Let  Wi  be 
the  amount  of  work  executed  with  degree  of  parallelism  i,  and  let  m  be  the  maximum  degree  of 
parallelism.  Thus,  W  =  Assuming  each  computation  takes  a  constant  time  to  finish  on 

a  given  processor,  the  execution  time  for  computing  Wi  with  a  single  processor  is 


W- 

h{Wi)=-^,  (2) 

where  A  is  the  computing  capacity  of  each  processor.  If  there  are  i  processors  available,  the 
execution  time  is 


*A' 


With  an  infinite  number  of  processors  available,  the  execution  time  will  not  be  further  decreased 
and  is 
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Figure  2.  Shape  of  the  application. 
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too(W,)  =  for  1  <  i  <  m. 

tA 

Therefore,  without  considering  communication  latency,  the  execution  times  on 

and  on  an  infinite  number  of  processors  are 

a  single  processor 

JIL  w- 
Ti(W)  =  E 

«s:l 

(3) 

w 

t=i 

(4) 

The  maximum  speedup,  with  work  W  and  an  infinite  number  of  processors,  is 

,  T,{W)  ES.,  ^  _  E",  w". 

^  Too(w)  t  ■  Er=i  Wi/i. 

(5) 

Average  parallelism  is  an  important  factor  for  speedup  and  efficiency.  It  has  been  carefully 
examined  in  [4].  Average  parallelism  is  equivalent  to  the  maximum  speedup  Soo  [4,  15].  Soo 
gives  the  best  possible  speedup  based  on  the  inherent  parallelism  of  an  algorithm.  There  are  no 
machine  dependent  factors  considered.  With  only  a  limited  number  of  available  processors  and 
with  communication  latency  considered,  the  speedup  will  be  less  than  the  best  speedup,  5oo(W)- 
If  there  are  N  processors  available  and  N  <  i,  then  some  processors  have  to  do  work  and 

the  rest  of  the  processors  will  do  work.  By  the  definition  of  degree  of  parallelism,  W,  and 

Wj  cannot  be  executed  simultaneously  for  i  ^  j.  Thus,  the  elapsed  time  will  be 
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Hence, 


(6) 

t=i 

and  the  speedup  is 

Communication  latency  is  another  factor  causing  performance  degradation.  Unlike  degree  of 
parallelism,  communication  latency  is  machine  dependent.  It  depends  on  the  communication  net¬ 
work  topology,  the  routing  scheme,  the  adopted  switching  technique,  and  the  dynamics  of  the 
network  traffic.  Let  Qn{W)  be  the  communication  overhead  when  N  processors  are  used  to  com¬ 
plete  W  amount  of  work.  The  actual  formulation  for  Qf]{W)  is  difficult  to  derive,  as  it  is  dependent 
on  the  communication  pattern  and  the  message  sizes  of  the  algorithm  itself,  as  well  as  the  system- 
dependent  communication  latency.  Note  that  Q7v(W’)  is  encountered  when  there  are  N  processors 
{N  >  1).  Assuming  that  the  degree  of  parallelism  does  not  change  due  to  communication  overbad, 
the  speedup  becomes 


^  tn{w)  ^ 

3  Speedup  of  Scaled  Problems 

In  the  last  section  we  developed  a  general  speedup  formula  and  showed  how  the  number  of  processors 
and  degradation  parameters  influence  the  performance.  However,  speedup  is  not  dependent  only 
on  these  parameters.  It  is  also  dependent  on  how  we  view  the  problem.  With  different  points 
of  view,  we  get  different  models  of  speedup  and  different  speedup  formulations.  One  viewpoint 
emphasizes  shortening  the  time  it  takes  to  solve  a  problem  by  parallel  processing.  With  more  and 
more  computation  power  available,  the  problem  can,  in  principle,  be  solved  in  less  and  less  time. 
With  more  processors  available,  the  system  will  provide  a  fast  turnaround  time  and  the  user  will 
have  a  shorter  waiting  time.  A  speedup  formulation  based  on  this  philosophy  is  called  fixed-size 
speedup.  In  the  previous  section,  we  implicitly  adopted  fixed-size  speedup.  Eq.  (8)  is  the  speedup 
formula  for  fixed-size  speedup.  Fixed-size  speedup  is  suitable  for  many  algorithms  in  which  the 


problem  size  cannot  be  scaled. 

For  some  applications  we  may  have  a  time  limitation,  but  we  may  not  want  to  obtain  the  solution 
in  the  shortest  possible  time.  If  we  have  more  computation  power,  we  may  want  to  increase  the 
problem  size,  carry  out  more  operations,  and  get  a  more  accurate  solution.  Various  finite  difference 
and  finite  element  algorithms  for  the  solution  of  Partial  Differential  Equations  (PDE’s)  are  typical 
examples  of  such  scalable  problems. 

An  important  issue  in  scalable  problems  is  the  identification  of  scalability  constraints.  One  scal- 
ablility  constraint  is  to  keep  the  execution  time  unchanged  with  respect  to  uniprocessor  execution 
time.  This  viewpoint  leads  to  a  different  model  of  speedup,  called  fixed-time  speedup.  For  fixed- time 
speedup  the  workload  is  scaled  up  with  the  number  of  processors  available.  Let  W  =  be 

the  total  amount  of  scaled  work,  where  W[  is  the  amount  of  scaled  work  executed  with  degree  of  par¬ 
allelism  i,  and  m'  be  the  maximum  degree  of  parallelism  of  the  scaled  problem  when  N  processors 
are  available.  Note  that  the  maximum  degree  of  parallelism  can  change  as  the  problem  is  scaled.  In 
order  to  keep  the  same  turnaround  time  as  the  sequential  version,  the  condition  Ti{W)  =  Tn{W') 
must  be  satisfied  for  W.  That  is,  the  following  scalable  constraint  must  be  satisfied. 


t'sl  1  =  1 


(9) 


Thus,  the  general  speedup  formula  for  fixed-time  speedup  is 


Ti(W') 


(10) 


T^(w>)  -  ^  ^  Er=i  • 

In  many  parallel  computers,  the  memory  size  plays  an  important  role  in  performance.  Many 
large  scale  multiprocessors  with  local  memory  architecture  do  not  support  virtual  memory  due  to 
insufficient  I/O  network  bandwidth.  When  solving  an  application  with  one  processor,  the  problem 
size  is  more  often  bounded  by  the  memory  limitation  than  by  the  execution  time  limitation.  With 
more  processors  available,  instead  of  keeping  the  execution  time  fixed,  we  may  want  to  meet  the 
memory  size  constraint.  In  other  words,  if  you  have  adequate  memory  space  and  the  scaled  problem 
meets  the  time  limit  imposed  by  fixed-time  speedup,  will  you  further  increase  the  problem  size  to 
yield  an  even  better  or  more  accurate  solution?  If  the  answer  is  yes,  the  appropriate  model  is 
memory-bounded  speedup.  Like  fixed-time  speedup,  memory- bounded  speedup  is  a  scaled  speedup. 
The  problem  size  scales  up  with  memory  size.  The  difference  is  that  in  fixed-time  speedup  execution 
time  is  the  limiting  factor  and  in  memory-bounded  speedup  memory  size  is  the  limiting  factor. 

With  memory  size  considered  as  a  factor  of  performance,  the  requirements  of  an  algorithm 
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consist  of  two  parts.  One  is  the  computation  requirement,  which  is  the  workload,  and  the  other 
is  the  memory  (capacity)  requirement.  For  a  given  algorithm,  these  two  requirements  are  related 
to  each  other,  and  the  workload  might  be  viewed  as  a  function  of  the  memory  requirement.  Let 
M  repres'^nt  the  memory  size  of  each  processor.  Let  p  be  a  function  such  that  W  =  g{M),  or 
M  =  where  g~^  is  the  inverse  function  of  g.  An  example  of  function  g  and  g~^  can 

be  found  in  Section  5.  In  a  homogeneous,  scalable,  parallel  computer,  the  memory  capacity  on 
each  node  is  fixed  and  the  total  memory  available  increases  linearly  with  the  number  of  processors 
available.  1{W  =  Wi  is  the  workload  for  execution  on  a  single  processor,  the  maximum  scaled 
workload  with  N  processors,  W  —  W?  must  satisfy  the  following  scalable  constraint, 

W  =  g(NM)  =  g(Ng-HW)),  (11) 


where  m*  is  the  maximum  degree  of  parallelism  of  the  scaled  problem  and  g  is  determined  by 
the  algorithm.  The  memory  limitation  can  be  stated  as:  the  memory  requirement  for  any  active 
processor  is  less  than  or  equal  to  M  =  W^)-  Here  the  main  point  is  that  the  memory 

occupied  on  each  processor  is  limited.  By  considering  the  communication  overhead,  Eq.  (12)  is  the 
general  speedup  formula  for  memory-bounded  speedup. 

E£'.  wr 

4  Simplified  Models  of  Speedup 

The  three  general  speedup  formulations  contain  both  uneven  allocation  and  communication  latency 
degradations.  They  give  better  upper  bounds  on  the  performance  of  parallel  applications.  On 
the  other  hand,  these  formulations  are  problem  dependent  and  difficult  to  understand.  They  give 
detailed  information  for  each  application,  but  lose  the  global  view  of  possible  performance  gains.  In 
this  section,  we  make  some  simplifying  assumptions.  We  assume  that  the  communication  overhead 
is  negligible,  i.e.,  Qn  =  0,  and  the  workload  only  contains  two  parts,  a  sequential  part  and  a 
perfectly  parallel  part.  That  is,  Wi  =  0,  for  i  ^  1  and  i  N.  We  also  assume  that  the  sequential 
part  is  independent  of  the  system  size,  i.e.,  Wi  =  W[  =  W^. 

Under  this  simplified  case,  the  general  fixed-size  speedup  formulation  (Eq.  8)  becomes 


(12) 


5Ar(W^)  = 


Wi  +  Wn 


(13) 


Eq.  (13)  is  known  as  Amdahl’s  law.  Figure  3  shows  that  when  the  number  of  processors  increases 
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the  load  on  each  processor  decreases.  Eventually,  the  sequential  part  will  dominate  the  performance 
and  the  speedup  is  bounded  by  •  In  Figure  3,  T\  is  the  execution  time  for  the  sequential 

portion  of  the  work,  amd  Tn  is  the  execution  time  for  the  parallel  portion  of  the  work. 


Number  of  Processors  (N)  Number  of  Processors  (N) 


Figure  3.  Amdahl’s  law. 


For  fixed-time  speedup  and  under  the  simplified  conditions,  the  scalability  constraint  (Eq.  9) 
becomes 

WL 

(14) 


W 

Wi  +  WN  =  W{  + 


Since  Wi  =  W[,  we  have  Wn  =  That  is  Wj^  =  NWn-  Eq.  (10)  becomes 


Wi  -f  NWn 
Wi-\-Wn  ■ 


(15) 


The  simplified  fixed-time  speedup  (Eq.  15)  is  known  as  Gustafson’s  scaled  speedup  [5].  From 
Eq.  (15)  we  can  see  that  the  parallel  portion  of  an  application  scales  up  linearly  with  the  system 
size.  The  relation  of  workload  and  elapsed  time  for  Gustafson’s  scaled  speedup  is  depicted  in  Figure 
4. 

We  need  some  preparation  before  deriving  the  simplified  formulation  for  memory-bounded 
speedup. 


Definition  2  A  function  g  is  a  semihomomorphism  if  there  exists  a  function  g  such  that  for  any 
real  number  c  and  any  variable  x  ,  g{cx)  =  g{c)g{x). 
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Number  of  Processors  (N)  Number  of  Processors  (N) 


Figure  4.  Gustafson’s  scaled  speedup. 

One  class  of  semihomomorphisms  is  the  power  function  ^(x)  =  x**,  where  ft  is  a  rational  number. 
In  this  case,  g  is  the  same  as  the  function  g.  Another  class  of  semihomomorphisms  is  the  single 
term  polynomial  ^(x)  =  ax*",  where  a  is  a  real  constant  and  6  is  a  rational  number.  For  this  kind 
of  semihomomorphism,  g{x)  —  which  is  not  the  same  as  g{x). 

Under  our  assumptions,  the  sequential  portion  of  the  workload,  vFi,  is  independent  of  the  system 
size.  If  the  influence  of  memory  on  the  sequential  portion  is  not  considered,  i.e.,  the  memory  capacity 
M  is  used  for  the  parallel  portion  only,  we  have  the  following  theorem. 

Theorem  1  If  W  =  g{M)  for  some  semihomomorphism  g,  g{cx)  =  g{c)g{x),  then,  with  all  data 
being  accessible  by  all  available  processors  and  using  all  available  memory  space,  the  simplified 
memory-bounded  speedup  is 


Wi-\-g{N)WN 

Wi  + 


(16) 


Proof;  Assume  that  the  maximum  problem  size  wifl  take  the  maximum  available  memory 
capacity  of  M  when  one  processor  is  used.  .4s  mentioned  before,  when  one  processor  is  available, 
the  parallel  portion  of  the  workload,  IF/v,  can  be  expressed  as  Wj^  =  g{M).  Since  all  data  are 
accessible  by  all  processors,  there  is  no  need  to  replicate  the  data.  With  N  processors  available, 
the  total  available  memory  capacity  wiU  be  increased  to  N M.  The  parallel  portion  of  the  problem 
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can  be  scaled  up  to  use  aU  available  memory  capacity  NM.  Thus,  the  scaled  parallel  portion, 
is  expressed  as  =  g{NM)  =  g(N)g{M).  Therefore,  =  g{N)WN  and 


Wt  +  _Wi+  giN)WN 


(17) 

□ 


Note  that  iii  Theorem  1,  we  made  two  assumptions  in  the  simplified  case:  1)  Since  the  commu¬ 
nication  latency  is  ignored,  remote  memory  accesses  take  the  same  time  as  local  memory  accesses. 
This  implies  that  the  data  is  accessible  by  all  available  processors,  and  2)  All  the  available  m-emory 
space  is  used  for  a  better  solution.  These  simplified  speedup  models  are  useful  to  demonstrate  how 
the  sequential  portion  of  an  application,  Wi,  will  affect  the  maximum  speedup  that  can  be  achieved 
with  different  number  of  processors.  Let  k  =  The  simpUfied  fixed-size  speedup,  fixed-time 

speedup,  and  memory-bounded  speedup  are,  respectively, 


5n(W"')  =  iV  -  )t(iV  -  1)  =  Jfc  +  iV(l  -  k),  and 
^  ^\giN)  +  KN-g{N))) 


(19) 

(20) 


When  the  number  of  processors,  iV,  goes  to  infinity,  Eq.  (18)  is  bounded  by  the  reciprocal  of 
k,  which  gives  the  maximum  value  of  the  fixed-size  speedup.  Eq.  (19)  shows  that  the  fixed- time 
speedup  is  a  linear  function  of  the  number  of  processors  with  slope  equal  to  (1  -  fc).  When  N 
goes  to  infinity,  this  speedup  can  increase  without  bound.  Memory-bounded  speedup  depends  on 
ihe  function  g{N).  When  g{N)  =  1,  memory-bounded  speedup  is  the  same  as  fixed-size  speedup. 
When  g{N)  =  N,  the  memory- bounded  speedup  is  the  same  as  the  fixed-time  speedup.  In  general, 
the  function  g{N)  is  application  dependent  and  g{N)  >  N.  It  implies  that  when  the  problem  size 
is  increased  by  N,  the  amount  of  work  increases  more  than  N  times.  It  is  easy  to  verify  that 
Si<j(W*)  >  Sn(W')  when  g{N)  >  N.  Note  that  all  data  in  memory  is  likely  to  be  accessed  at  least 
once.  Thus,  for  scaled  problems,  g{N)  <  N  is  unlikely  to  occur.  The  sequential  portion  of  the 
work  plays  different  roles  in  the  three  definitions  of  speedup.  In  fixed-size  speedup,  the  influence 
of  the  sequential  portion  increases  with  system  size  and  eventually  dominates  the  performance.  In 
fixed-time  speedup,  the  influence  of  the  sequential  portion  is  unchanged  which  makes  the  speedup 
a  linear  function  of  system  size.  In  the  memory- bounded  speedup,  since  in  general  g{N)  >  N, 
the  influence  of  the  sequential  portion  is  reduced  when  the  system  size  increases,  indicating  that  a 
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better  speedup  could  be  achieved  with  a  larger  system  size. 

The  function  g{N)  provides  a  metric  to  evaluate  parallel  algorithms.  In  general,  g{N)  may  not 
be  derivable  for  a  given  algorithm.  Note  that  any  single  term  polynomial  is  a  semihomomorphism, 
and  most  solvable  algorithms  have  polynomial  time  computation  and  memory  requirement.  If  we 
take  an  algorithm’s  computation  and  storage  complexity  (the  term  with  the  largest  power)  as  its 
computation  and  memory  requirement,  for  any  algorithm  with  polynomial  complexity  there  exists  a 
semihomomorphism  g,  such  that  W  =  g(M).  The  approximated  semihomomorphism  g  will  provide 
a  good  estimation  on  the  memory-bounded  speedup  when  the  number  of  processors  is  large.  More 
detailed  case  studies  for  the  three  models  of  speedup  can  be  found  in  [13]. 

Figure  5  demonstrates  the  difference  between  the  three  models  of  speedup  when  k  =  0.3  and  N 

3 

ranges  from  1  to  1024.  For  the  simplified  memory-bounded  (SMB)  speedup,  we  choose  giN)  =  iVa , 
which  is  typical  in  many  matrix  operations  to  be  described  later.  When  g(N)  =  N,  it  is  Gustafson’s 
scaled  speedup.  The  case  of  G(N)  =  (1  +  ^[1  —  will  be  studied  in  next  section. 


Figure  5.  Amdahl’s  law,  Gustafson’s  speedup,  and  SMB  speedup  for  k  =  0.3. 


5  Corr  “nunication-Memory  Tradeoflf 

The  simplified  speedup  formulations  give  the  impact  of  the  sequential  portion  of  an  application 
on  the  maximum  speedup.  The  simplified  memory-bounded  speedup  suggests  that  when  data  are 
shared  by  all  processors,  maximum  speedup  is  obtained.  However,  in  practice  if  communication 
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overhead  is  considered,  the  data  sharing  approach  may  not  lead  to  maximum  speedup.  In  the 
design  of  efficient  parallel  algorithms,  the  communication  cost  plays  an  important  role  in  deciding 
how  a  problem  should  be  solved  and  scaled.  One  way  to  reduce  the  frequency  of  communication 
is  to  replicate  some  shared  data  to  processors.  Thus,  a  good  algorithm  design  should  consider  the 
tradeoff  between  the  maximum  size  that  a  problem  can  scale  and  the  reduction  of  available  memory 
due  to  the  replication  of  shared  data. 

If  data  replication  is  allowed,  the  function  IV  =  g(NM)  will  no  longer  hold.  Motivated  by 
Theorem  1,  the  function  G{N)  =  WJ^jWN  is  defined  to  represent  the  ratio  of  work  increment 
when  N  processors  are  available.  In  terms  of  G{N),  the  simplified  memory-bounded  speedup  is 
generalized  below. 

Theorem  2  IfWi  is  independent  of  system  size,  Wi  =  0  for  1  <  i  <  N ,  and  =  G(N)Wjv  for 
some  function  G{N),  the  memory-bounded  speedup  is 

Wr-\-GiN)WN  .21) 

Wi  +  ^WN  +  QNiW*)' 

The  proof  of  Theorem  2  is  similar  to  the  proof  of  Theorem  1.  Eq.  (21)  shows  that  the  maxi¬ 
mum  speedup  is  not  necessarily  achieved  when  G{N)  =  g{N).  Note  that  the  communication  cost 
QsiW*)  is  a  unified  communication  cost.  An  optimal  choice  of  the  function  G{N)  is  both  algo¬ 
rithm  and  architecture  dependent  and,  in  general,  is  difficult  to  obtain.  Also,  unlike  g(JV),  G(N) 
might  be  less  than  N.  If  G{N)  <  N,  memory  capacity  is  likely  to  be  the  scalable  constraint  when 
N  is  large.  If  G{N)  >  N,  execution  time  is  likely  to  be  the  scalable  constraint.  The  function  G{N) 
indicates  the  possible  scalable  constraint  of  an  algorithm.  The  proposed  scaled  speedup  (Eq.  21) 
may  not  be  easy  to  fully  understand  at  first  glance.  Hence,  we  use  matrix  multiplication  as  an 
example  to  illustrate  it. 

A  matrix  often  represents  some  discretized  continuum.  Enlarging  the  matrix  size  will  generally 
lead  to  a  more  accurate  solution  for  the  continuum.  For  matrix  multiplication  C  =  AB,  there  are 
many  ways  to  partition  the  matrices  A  and  B  to  allow  parallel  processing  [11].  Assume  that  there 
are  N  processors  available,  and  A  and  B  are  n  x  n  matrices  when  executing  on  a  single  processor. 
The  computation  requirement  is  2?!^  and  the  memory  requirement  is  roughly  3n^.  Thus,  =  2n^ 
and  M  =  3n*.  Two  extreme  cases  of  memory-bounded  scaled  speedup  are  considered. 

Local  Computation 

In  the  first  case,  we  assume  that  the  communication  cost  is  extremely  high.  Thus,  data  should  be 
replicated  if  possible  to  reduce  communication.  This  can  be  achieved  by  partitioning  the  columns 
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of  matrix  B  into  N  submatrices,  and  replicating  the  matrix  A.  Thus,  5,’s  are 

distributed  among  all  the  processors  and  matrix  A  is  replicated  on  each  processor.  Processor  i 
does  the  multiplication  ABi  =  Ci,  i  =  0,  •••,*¥-  1,  independently.  Since  there  is  no  need  for 
communication,  it  is  referred  to  as  local  computation  approach.  Figure  6(a)  shows  the  partitioning 
of  B  for  the  case  of  iV  =  4. 


(a)  The  matrix  B  is  partitioned. 


Figure  6.  Two  partitioning  schemes  of  matrices  A  and  B. 


If  both  A  and  B  are  allowed  to  scale  along  any  dimension  and  A  and  B  are  not  necessary  to  be 
square  matrices,  the  enlarged  problem  is  A^B*  =  C*,  where  A*  is  an  ^  x  A  matrix,  B*  is  a  A;  x  m 
matrix,  and  the  resulting  matrix  C*  is  an  f  x  m  matrix.  Note  that  the  local  memory  capacity 
is  M  =  3n^.  It  is  easy  to  see  that  the  majumum  memory-bound  speedup  will  be  achieved  when 
i  =  k  =  n,  and  m  =  nN.  In  other  words,  both  B  and  C  are  scaled  up  N  times  along  their  rows,  and 
A  is  replicated  but  not  scaled.  The  amount  of  computation  on  each  processor  is  fixed,  Wn  =  2n^, 
and  =  NWn-  Thus,  we  have  G{N)  =  N.  The  memory-bounded  scaled  speedup  is 


Sn{W*) 


Wx  +  NWn 
Wx  +  Wn  ’ 


which  is  Gustafson’s  scaled  speedup.  Thus,  the  best  performance  of  memory- bounded  speedup 
using  the  local  computation  model  is  the  same  as  the  Gustafson’s  scaled  speedup.  In  general,  the 
local  computation  model  will  lead  to  a  speedup  that  is  less  than  Gustafson’s  scaled  speedup.  For 
example,  if  both  A  and  B  are  restricted  to  square  matrices,  the  function  G{N)  will  be 
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which  is  less  than  N,  for  jV  >  1,  and  is  bounded  by  3*  (see  Appendix).  Note  that  due  to  data 
replication,  the  memory  capacity  requirement  increases  faster  than  the  computation  requirement 
does. 


Global  Computation 

In  the  second  extreme  case,  we  assume  that  the  communication  cost  is  negligible.  Thus,  there  is 
no  need  to  replicate  the  data.  A  bigger  problem  can  be  solved.  We  partition  matrix  A  into  N  row 
blocks  and  B  into  N  column  blocks  (See  Figure  6(b)).  By  assigning  each  pair  of  submatrices,  Ai  and 
Bi,  to  one  processor  initially,  all  main  diagonal  blocks  of  C  can  be  computed.  Then,  the  row  blocks 
of  A  are  rotated  from  one  processor  to  another  after  each  row-column  submatrix  multiplication. 
With  N  processors,  N  —  1  rotations  are  needed  to  finish  the  computation  as  shown  in  Figure  7  for 
the  case  of  N  =  4.  This  method  is  referred  to  as  global  computation. 

For  the  global  computation  approach,  the  maximum  scaled  speedup  is  achieved  when  t  sf  k  = 
m  =  ny/N  (see  Appendix). 


5yv(W*)=, 


Wi  +  N^Wn 


(22) 


Wi  +  N^n 

The  corresponding  function  G{N)  =  ivt.  Assuming  N  <  n^,  we  can  write  Wn  as  a  function  of  M 
as  follows. 


Ws  =  fl(M)  = 


_  /m 


V  3 


(23) 


Increasing  the  total  memory  capacity  to  NM,  we  have 

=  jvt  =  jvl  W;v  =  ^(IV)W^. 


(24) 


The  matrix  multiplication  problem  has  a  semihomomorphism  between  its  memory  requirement  and 
computation  requirement  and  g{N)  =  N^.  Assuming  a  negligible  communication  cost,  the  global 
computation  approach  will  achieve  the  best  possible  scaled  speedup  of  the  matrix  multiplication 
problem. 

We  have  studied  two  extreme  cases  of  memory-bounded  scaled  speedup  which  are  based  on 
global  computation  and  local  computation.  In  general  for  most  of  the  algorithms,  part  of  the  data 
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(a)  step  1 


(b)  step  2 


(c)  step  3 


(d)  step  4 

Figure  7.  Matrix  multiplication  without  data  replication. 
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may  be  replicated  and  part  of  the  data  may  have  to  be  shared.  Deriving  a  speedup  formulation  for 
these  algorithms  is  difficult,  not  only  because  we  are  facing  a  more  complicated  situation,  but  also 
because  the  ratio  between  replicated  and  shared  data  is  uncertain.  The  replicated  part  may  not 
increase  as  the  system  size  is  increased.  In  case  the  replicated  part  does  increase,  its  speed  of  increase 
may  be  different  from  the  speed  that  the  shared  part  is  increased.  Also,  an  algorithm  may  start  with 
global  computation.  When  the  system  size  is  increased,  replication  may  be  needed  as  part  of  the 
effort  to  reduce  communication  overhead.  A  special  combined  case,  G{N)  =  (1  +  ^[1  —  — 
has  been  carefully  studied  in  [15].  The  structure  of  that  study  can  be  used  as  a  guideline  for  other 
algorithms. 

The  influence  of  communication  overhead  on  the  best  performance  of  the  memory-bounded 
speedup  is  studied.  The  study  can  be  extended  to  fixed-time  speedup,  where  redundant  computa¬ 
tion  could  be  introduced  to  reduce  the  communication  overhead.  The  function  G(JV}  determines 
the  actual  achieved  speedup.  We  have  shown  how  the  partition  and  scale  of  the  problem  will  influ¬ 
ence  the  function  G{N).  In  general,  finding  an  optimal  function  G{N)  is  a  non-linear  optimization 
problem.  The  concept  of  the  function  G(N)  can  be  extended  to  algorithms  with  multi-degree  of 
parallelism. 

6  Conclusions 

It  is  known  that  the  performance  of  parallel  processing  is  influenced  by  the  inherent  parallelism 
and  communication  requirement  of  the  algorithm,  by  the  computation  and  communication  power 
of  the  underlying  architecture,  and  by  the  memory  capacity  of  the  parallel  computer  system.  How¬ 
ever,  how  are  these  factors  related  to  each  other,  and  how  do  they  influence  the  performance  of 
parallel  processing  is  generally  unknown.  Discovering  the  answers  to  these  unknowns  is  important 
for  designing  efficient  parallel  algorithms.  In  this  paper  one  model  of  speedup,  memory-bounded 
speedup^  is  carefully  studied.  The  model  contadns  these  factors  as  its  parameters. 

As  part  of  the  study  on  performance,  two  other  models  of  speedup  have  also  been  studied. 
They  are  fixed-size  speedup  and  fixed-time  speedup.  Two  sets  of  speedup  formulations  have  been 
derived  for  these  two  models  of  speedup  and  for  memory-bounded  speedup.  Formulations  in  the 
first  set  give  rise  to  generalized  speedup  formulas.  The  second  set  of  formulations  only  considers  a 
special,  simplified  case.  The  simplified  fixed-size  speedup  is  Amdahl’s  law,  the  simplified  fixed-time 
speedup  is  Gustafson’s  scaled  speedup,  and  the  simplified  memory-bounded  speedup  contains  both 
Amdahl’s  law  and  Gustafson’s  scaled  speedup  as  special  cases. 

The  three  models  of  speedup,  fixed-size  speedup,  fixed-time  speedup  and  memory-bounded 
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speedup,  are  based  on  different  viewpoints  and  are  suitable  for  different  classes  of  algorithms.  How¬ 
ever,  algorithms  exist  which  do  not  fit  any  of  the  models  of  speedup,  but  satisfy  some  combination 
of  the  models. 


Appendix 

When  communication  does  not  occur  (local  computation)  or  its  cost  is  negligible,  the  memory- 
bounded  speedup  equation  (21)  becomes 

Wi  +  G(N)Wn 

It  is  easy  to  verify  that  increases  with  the  function  G{N).  Thus,  for  the  two  extreme  ctises 
considered  in  Section  5,  the  problem  of  how  to  reach  the  majdmum  speedup  becomes  how  to  scale 
the  matrix  A  and  B  such  that  the  function  G(N)  reaches  its  maximum  value.  The  matrix  A  and 
B  can  be  scaled  in  any  dimension.  A  general  scaled  matrix  multiplication  problem  is 


Al»k^k*m  —  Clmrn^ 


where  both  A  and  B  are  rectangular  matrices.  To  achieve  an  optimal  speedup,  we  need  to  decide 
the  integers  /,  k,  and  m,  for  which  that  the  function  G{N)  reaches  the  maximum  value.  The 
following  result  gives  the  optimal  /,  k,  and  m  for  the  global  computation  approach  (Fig.  6(b)) 
given  in  Section  5.  Recall  that  N  is  the  number  of  processors. 

Proposition  1  If  A  and  B  are  n  x  n  matrices  when  N  =  1,  then  the  global  computation  approach 
reaches  the  maximum  G{N)  when  I  =  k  =  n  and  m  =  n  x  y/N,  excluding  the  communication  cost. 
The  corresponding  G{N)  equals  and  the  maximum  speedup  is 


sh  = 


Wi  + 


(26) 


Wi  + 

Proof:  By  the  partition  schema  of  the  global  computation  approach,  the  rows  of  matrix  A 

and  the  columns  of  matrix  B  are  distributed  among  processors.  The  workload  on  each  processor  is 


m  • 
N 


Since  the  memory  is  fully  filled, 


—  *  k  +  k  * 
N 


m  I 

- 1 - * 

N  N 


m 

A 


3ra^. 
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Thus, 


(27) 


Q„2 _ L  *  25. 

^  _  'in  jv  *  jv 

I  •  m 

77  + W 


The  work  of  the  scaled  problem  is 


Wtr  =  2*l*Tn*k  =  2*l*m* 


Zn^N  -I* 
I  +  m 


*m  Zn^N  —  l*m  (3n^  -  N  —  I  *  Tn)(l  *  m) 

2  /  ,  ""  ~  / 1  I  \  3  * 

n*^  /  +  m  (Z  +  m)  ♦  n*’ 


Ar^  (3«^  -N  -I*  m){l  *  m) 
- - (TT^f*  n3 - ■ 


Therefore,  G(N)  reaches  its  maximum  value  if  and  only  if  the  function 


.  (Zn^  —  N  —  I  *  m){l  *  m) 


reaches  its  maximum  value.  At  its  maximum  value,  the  derivatives  of  /(/,  m)  satisfy 

f'l  =  —l^m^  —  2/m^  +  Zn^m^N  =  0, 

fin  =  —l^m^  —  2ml^  +  Zn^rn^N  =  0. 

It  leads  to 

+  2/m  —  Zn^N  =  0, 

m^  +  2/m  —  Zn^N  =  0. 

This  is 


(/  +  m)^  =  m^  +  Zii^N. 
(m  +  /)2  =  /2  +  Sn^AT. 


Thus,  we  have  m^  =  /^,  i.e. 


18 


I  =  m 


(31) 


Combining  the  Eq.  (31)  and  Eq.  (29),  we  get 


1 1=  m  —  ny/J^. 


From  the  Eq.  (27),  we  have  k  ~  ny/N.  Thus,  the  enlarged  A  and  B  are  still  square  matrices,  with 
dimension  ny/N.  By  Eq.  (28)  the  maximum  G{N)  is 

(nVN)H3n^^  -  (ny/N)^) 

- - n3(2nV]V) - =  ^  ’ 

which  is  equal  to  the  memory-work  function  g(N)  for  the  matrix  multiplication  problem  (see  Section 
5),  and  the  corresponding  speedup  is 

^  Wi  + 

From  Theorem  1,  it  is  the  best  possible  performance  for  the  matrix  multiplication  problem.  □ 

Using  similar  arguments  as  in  Proposition  1,  we  can  find  that  the  optimal  dimension  of  the  local 
computation  approach  is  I  =  k  =  n,  m  —  nN,  and  the  maximum  value  of  G{N)  is  N  (see  Section 
5).  The  scalability  of  matrix  A  and  B  is  application  dependent.  If  A  and  B  should  be  maintained  as 
square  matrices,  the  following  proposition  shows  the  limitation  of  the  local  computation  approach. 

Proposition  2  If  A  and  B  are  n  y  n  matrices  when  JV  =  1,  and  I  =  k  =  m  is  required,  then  the 
maximum  value  of  G{N)  of  the  local  computation  approach  is  >  which  is  bounded  by  3? 

and  is  smaller  than  N,  for  N  >  1. 

Proof:  When  A  and  B  are  square  matrices,  the  scaled  problem  is 

Ak*kBk*k  =  Gk»k- 

If  the  load  is  balanced  on  each  processor,  and  m  =  ^  is  an  integer,  then  each  processor  does  the 
work 


Ak*kBkmm  —  Gk»m’ 
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When  memory  is  fully  used, 


+  2k  *m  —  3n^. 


Since  m  = 


Thus, 


The  scaled  work 


and 


Since 


and 


,2  2Jfc2  «  2 


k  = 


=  2k^  = 


I  3n2 

1  +  -3- 

N 


3N 
N  +  2 


I  3N 
N  +  2 


n. 


G{N)  = 


2n^  = 


ZN 


ZN 
N  +  2 


N  +  2  ■ 

\  ' 

ZN  ZN  +  6  6 


N  +  2  N  +  2  N  +  2 

ZN  3 
N  +  2~  N  +  2 


=  3- 


xN, 


N  +  2 


the  G{N)  is  bounded  by  3^  and  is  smaller  than  N,  for  N  >  1. 
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